Lab Assignment Three: Extending Logistic Regression¶

Created by the Dream Team: Daniel Ryan, Josh Hascall, and Zach Suzuki

Preparation and Overview (3 points total)¶

  • [2 points] Explain the task and what business-case or use-case it is designed to solve (or designed to investigate). Detail exactly what the classification task is and what parties would be interested in the results. For example, would the model be deployed or used mostly for offline analysis?

The task for classifying is to determine which flight class (eco, eco plus, business) a customer that has filled out the survey for belongs in. The business case for this is to help the airline understand which customers it is best for us to follow up with given their satisfaction with the flight. We are operating under the assumption that the passenger fills out the survey provided but isn't asked to include their class in the survey or future regulations prevent us from asking this question. If the latter scenario occurs, then we would have the data to train and test a logistic regression model.

We think that estimating the class of the respondent is worth training a model on because we can target different types and amounts of advertising content based on this amount. It is our belief that airlines are most interested in retaining and improving relationships with business class clients over economy clients. This is simply because business class clients pay more for their flights; using business jargon, Business class fliers have the highest LTV (lifetime value). Business class tickets are worth orders of magnitude more than economy tickets much of the time, which means the return on investment of advertising to a business class client and getting them to fly with the same airline again is much higher than the retention of an economy customer, even if we must spend more on advertising to acquire and retain these customers. The ultimate goal is to maximize the LTV / CAC (customer acquisition cost) ratio in order to get the most profit. By understanding which customers we are marketing to, we gain better understanding of the potential LTV, and a better idea of how much we can spend to acquire that customer.

  • This model would be used offline. Once we have the dataset of respondents, we can train the model to determine the class for future respondents. The outcome of this will be email/snail mail promotions of different amounts (business class gets highest, eco lowest). This amount can also be adjusted based on the respondents' satisfaction with their flight.
  • How good does it need to be? It does not need to be state of the art, but the expected loss of incorrectly classifying a respondent would be, at most, missing an opportunity to advertise to a business class client, or wasted advertising spent on a poor customer.

We want to note that this is an extremely difficult classification task given the low correlation of these survey responses with class. What we're trying to say is that it is very valuble if we can train a logistic regression algorithm to effectively classify this because it's difficult to do so and we can yield value in higher LTV/CAC in doing so.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from scipy.special import expit
from sklearn.impute import KNNImputer
import copy
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
  • [.5 points] (mostly the same processes as from previous labs) Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).
    • one hot encoding for categorical; normalize to ease optimization of numerical data
In [2]:
df = pd.read_csv("APS.csv", index_col=0)
df.head()
Out[2]:
id Gender Customer Type Age Type of Travel Class Flight Distance Inflight wifi service Departure/Arrival time convenient Ease of Online booking ... Inflight entertainment On-board service Leg room service Baggage handling Checkin service Inflight service Cleanliness Departure Delay in Minutes Arrival Delay in Minutes satisfaction
0 70172 Male Loyal Customer 13 Personal Travel Eco Plus 460 3 4 3 ... 5 4 3 4 4 5 5 25 18.0 neutral or dissatisfied
1 5047 Male disloyal Customer 25 Business travel Business 235 3 2 3 ... 1 1 5 3 1 4 1 1 6.0 neutral or dissatisfied
2 110028 Female Loyal Customer 26 Business travel Business 1142 2 2 2 ... 5 4 3 4 4 4 5 0 0.0 satisfied
3 24026 Female Loyal Customer 25 Business travel Business 562 2 5 5 ... 2 2 5 3 1 4 2 11 9.0 neutral or dissatisfied
4 119299 Male Loyal Customer 61 Business travel Business 214 3 3 3 ... 3 3 4 4 3 3 3 0 0.0 satisfied

5 rows × 24 columns

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 103904 entries, 0 to 103903
Data columns (total 24 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   id                                 103904 non-null  int64  
 1   Gender                             103904 non-null  object 
 2   Customer Type                      103904 non-null  object 
 3   Age                                103904 non-null  int64  
 4   Type of Travel                     103904 non-null  object 
 5   Class                              103904 non-null  object 
 6   Flight Distance                    103904 non-null  int64  
 7   Inflight wifi service              103904 non-null  int64  
 8   Departure/Arrival time convenient  103904 non-null  int64  
 9   Ease of Online booking             103904 non-null  int64  
 10  Gate location                      103904 non-null  int64  
 11  Food and drink                     103904 non-null  int64  
 12  Online boarding                    103904 non-null  int64  
 13  Seat comfort                       103904 non-null  int64  
 14  Inflight entertainment             103904 non-null  int64  
 15  On-board service                   103904 non-null  int64  
 16  Leg room service                   103904 non-null  int64  
 17  Baggage handling                   103904 non-null  int64  
 18  Checkin service                    103904 non-null  int64  
 19  Inflight service                   103904 non-null  int64  
 20  Cleanliness                        103904 non-null  int64  
 21  Departure Delay in Minutes         103904 non-null  int64  
 22  Arrival Delay in Minutes           103594 non-null  float64
 23  satisfaction                       103904 non-null  object 
dtypes: float64(1), int64(18), object(5)
memory usage: 19.8+ MB
In [4]:
# get object for imputation
knn_obj = KNNImputer(n_neighbors=3)

features_to_use = ['Age','Flight Distance','Departure/Arrival time convenient','Departure Delay in Minutes', 'Arrival Delay in Minutes']

# create a numpy matrix from pandas numeric values to impute
temp = df[features_to_use].to_numpy()

# use sklearn imputation object
knn_obj.fit(temp)
temp_imputed = knn_obj.transform(temp)

# Deep copy
df_imputed = copy.deepcopy(df) # not just an alias
df_imputed[features_to_use] = temp_imputed
df_imputed.dropna(inplace=True)

There are some things to remove from our dataset because they should have no bearing on class prediction. We're going to leave in general reviews of some flight amenities and experiences because there's a chance these improve the likleihood of a correct prediction (ex: A passenger who gave food service 5 stars is probably more likely to be in business class).

1. overall satisfaction
2. Arrival and departure delay
    - This says nothing about which class a person flew in. We will leave in the satisfaction a respondent had with their arrival/delay, as business folk could be more likely to respond a certain way.
3. id
    -  These numbers aren't especially important, since there are no examples of duplicate IDs and in testing and PCA analysis they seem to interfere with our results.
In [5]:
#del df_imputed['Class']
del df_imputed['Departure Delay in Minutes']
del df_imputed['Arrival Delay in Minutes']
del df_imputed['id']
del df_imputed['satisfaction']
In [6]:
# mapping categorical data to integers in order to run correlation
df_imputed['Customer Type'] = df_imputed['Customer Type'].map({'disloyal Customer': 0, 'Loyal Customer': 1})
df_imputed['Gender'] = df_imputed['Gender'].map({'Male': 0, 'Female': 1})
df_imputed['Type of Travel'] = df_imputed['Type of Travel'].map({'Personal Travel': 0, 'Business travel': 1})

# convert Class to ranked integers for easier analysis
df_imputed['Class'] = df_imputed['Class'].map({'Eco': 1, 'Eco Plus': 2, 'Business': 3})
In [7]:
# normalizing Age and Flight distance using a max absolute value scaling
def MaxAbsScaler(col):
    return col / col.abs().max()

# normalization
df_imputed['Age'] = MaxAbsScaler(df_imputed['Age'])
df_imputed['Flight Distance'] = MaxAbsScaler(df_imputed['Flight Distance'])
df_imputed.head()
Out[7]:
Gender Customer Type Age Type of Travel Class Flight Distance Inflight wifi service Departure/Arrival time convenient Ease of Online booking Gate location Food and drink Online boarding Seat comfort Inflight entertainment On-board service Leg room service Baggage handling Checkin service Inflight service Cleanliness
0 0 1 0.152941 0 2 0.092314 3 4.0 3 1 5 3 5 5 4 3 4 4 5 5
1 0 0 0.294118 1 3 0.047160 3 2.0 3 3 1 3 1 1 1 5 3 1 4 1
2 1 1 0.305882 1 3 0.229179 2 2.0 2 2 5 5 5 5 4 3 4 4 4 5
3 1 1 0.294118 1 3 0.112783 2 5.0 5 5 2 2 2 2 2 5 3 1 4 2
4 0 1 0.717647 1 3 0.042946 3 3.0 3 3 4 5 5 3 3 4 4 3 3 3

Just to visualize and discuss, we're going to print the remaining values and discuss correlations between these fields and our passenger class

In [8]:
# print the correlation of all values and sort them from largest to smallest
print('Correlation to Class by percentage')
print((df_imputed.corr()['Class'].sort_values(ascending=False)))

vars_to_use = [
    'Class',
    'Departure/Arrival time convenient',
    'Ease of Online booking',
    'Gate location',
    'Food and drink',
    'Online boarding',
    'Seat comfort',
    'Inflight entertainment',
    'On-board service',
    'Leg room service',
    'Baggage handling',
    'Checkin service',
    'Inflight service',
    'Cleanliness',
]
plt.pcolor(df_imputed[vars_to_use].corr())

plt.yticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.xticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use, rotation=90)
plt.colorbar()
plt.show()
Correlation to Class by percentage
Class                                1.000000
Type of Travel                       0.545257
Flight Distance                      0.451211
Online boarding                      0.322924
Seat comfort                         0.227444
On-board service                     0.209505
Leg room service                     0.204964
Inflight entertainment               0.194366
Baggage handling                     0.160460
Inflight service                     0.156353
Checkin service                      0.151613
Age                                  0.140565
Cleanliness                          0.135818
Ease of Online booking               0.106391
Customer Type                        0.105735
Food and drink                       0.085908
Inflight wifi service                0.036279
Gate location                        0.004150
Gender                              -0.008253
Departure/Arrival time convenient   -0.092788
Name: Class, dtype: float64

Inflight wifi service, Gate location, and Gender all have less than 10% correlation with Class. We're going to remove gate location because we also don't see an intuitive reason to keep it and we're going to remove gender for two reasons:

  1. It doesn't seem to add any value to our classification problem
  2. We don't want to accidentily train gender bias into our model

It's also worth noting again that this classification problem is very difficult; there are fairly low correlations between all attributes and class.

In [9]:
#remove gender and gate location
del df_imputed['Gender']
del df_imputed['Gate location']
In [10]:
#splitting our dataset into X and y
y = df_imputed['Class']
X = df_imputed.drop('Class',axis=1)
In [11]:
df_imputed.groupby('Class')['Age'].count()/len(df_imputed)
Out[11]:
Class
1    0.449886
2    0.072124
3    0.477989
Name: Age, dtype: float64

This is good to know for future reference. It's not uncommon in our testing that our accuracy is exactly equal to one of these class percentages. What this means is our training is really bad and the regression learns to just say "1" every time.

PCA¶

In [12]:
len(X.columns)
Out[12]:
17

Even after removing lots of fields, we still have a TON of dimensions to our dataset. This is a double edged sword; it can make our result more accurate, but it will slow down the performance of our program. What will we do? Well, we're going to produce and test datasets with reduced dimensions and original dimensions. Then, we'll discuss the pros and cons of both.

But how many dimensions should we use for PCA?

In [13]:
# this is a scree plot
def plot_explained_variance(pca):
    import plotly
    from plotly.graph_objs import Bar, Line
    from plotly.graph_objs import Scatter, Layout
    from plotly.graph_objs.scatter import Marker
    from plotly.graph_objs.layout import XAxis, YAxis
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = pca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })
        

pca = PCA(n_components=17)
pca.fit(X)
plot_explained_variance(pca)

We see the "knee" in this dataset around 8 components. Any more and we might as well use the regular dataset! Any less and we worry we'll lose too much accuracy!

In [30]:
from sklearn.decomposition import PCA

pca = PCA(n_components=8)
pca.fit(X) # fit data and then transform it
X_pca = pca.transform(X)
  • [.5 points] Divide your data into training and testing data using an 80% training and 20% testing split. Use the cross validation modules that are part of scikit-learn. Argue "for" or "against" splitting your data using an 80/20 split. That is, why is the 80/20 split appropriate (or not) for your dataset?
    • Just do 80/20, use any package part of scikit-learn
    • argue whether it was a good idea- if not a good idea (small data + lots of classes, etc), still do 80/20 but say what you would theoretically do instead

Rationale¶

  • An 80/20 split seems like a decent split to us, though we'd intuit that a higher split would probably be better for this dataset.
  • We have a relatively large number of instances (100k+), but a very difficult classification problem with quite a few dimensions.
  • We would want to use as much of our data as possible to train the dataset, so we think we could probably make a case for a greater split (90%/10%). 10% of this dataset is still 10k instances
    • in our eyes this is more than enough to know whether the classifier is working. The larger training dataset should improve performance and have minimal impact on reducing testing accuracy or value.
In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

#split the PCA set
X_train_pca, X_test_pca = train_test_split(X_pca, test_size=0.2)

#confirm size of split data
print("Test: ","{:.2%}".format(len(X_test)/len(df_imputed)))
print("Train: ", "{:.2%}".format(len(X_train)/len(df_imputed)),'\n')
print("Test - pca: ","{:.2%}".format(len(X_test_pca)/len(df_imputed)))
print("Train - pca: ", "{:.2%}".format(len(X_train_pca)/len(df_imputed)))
Test:  20.00%
Train:  80.00% 

Test - pca:  20.00%
Train - pca:  80.00%

Modeling (5 points total)¶

  • The implementation of logistic regression must be written only from the examples given to you by the instructor. No credit will be assigned to teams that copy implementations from another source, regardless of if the code is properly cited.
    • reinvent the wheel, dont copy
  • [2 points] Create a custom, one-versus-all logistic regression classifier using numpy and scipy to optimize. Use object oriented conventions identical to scikit-learn. You should start with the template developed by the instructor in the course. You should add the following functionality to the logistic regression classifier:
    • instantiate object, call .fit and .predict; if you fill in any methods/properties, they will end in an underscore
    • Ability to choose optimization technique when class is instantiated: either steepest ascent, stochastic gradient ascent, and {Newton's method/Quasi Newton methods}.
      • customizable into a single class, user only sees one thing, but under the hood is up to you (solver = type, if statement, etc.)
    • Update the gradient calculation to include a customizable regularization term (either using no regularization, L1 regularization, L2 regularization, or both L1 and L2 regularization). Associate a cost with the regularization term, "C", that can be adjusted when the class is instantiated.
      • no reg: no c term; L1: sum of squares; L2: sum of absolute value; or both with C1 & C2
      • find optimal c by trial and error, achieve for ok-good performance, aim for range [1e-4, 1]

if doing both L1 and L2 regularization (regularization=3), C needs to be a tuple (C1, C2). Otherwise C is a single value

choices for solver: SteepestAscent, StochasticGradientAscent, BFGS

In [32]:
class SteepestAscent:
    def __init__(self, eta, iterations=20, C=0.001, regularization=2):
        self.eta = eta
        self.iters = iterations
        self.reg = regularization

        self.C1 = self.C2 = 0
        if regularization==1:
            self.C1 = C
        elif regularization==2:
            self.C2 = C
        elif regularization==3:
            self.C1, self.C2 = C
        # store weights internally as self.w_ 
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return'Binary Logistic Regression Object with coefficients:\n' + str(self.w_) # if we trained object
        else:
            return 'Untrained Binary Logistic Regression Object'
    
    # helper, private
    @staticmethod
    def _add_bias(X):
        return np.hstack((np.ones((X.shape[0],1)), X)) # adds bias term [0, X]
    
    @staticmethod
    def _sigmoid(theta):
        # scipy element wise sigmoid function
        return expit(theta) # 1/(1+np.exp(-theta))

    # vectorized gradient w L2 regularization
    def _get_gradient(self, X, y):
        ydiff = y - self.predict_proba(X, add_bias=False).ravel() # y - g(w^T*X)
        gradient = np.mean(X * ydiff[:,np.newaxis], axis = 0) # make ydiff column vector and mult through

        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += (np.sign(self.w_[1:]) * self.C1) + (-2 * self.w_[1:] * self.C2)

        return gradient

    # public
    def predict_proba(self, X, add_bias=True):
        Xb = self._add_bias(X) if add_bias else X
        return self._sigmoid(Xb @ self.w_) # return probability of y=1

    def predict(self, X):
        return (self.predict_proba(X) > 0.5) # return actual prediciton

    def fit(self, X, y):
        Xb = self._add_bias(X)
        num_samples, num_features = Xb.shape

        self.w_ = np.zeros((num_features,1)) # init weight vector to 0s

        # max iterations
        for _ in range(self.iters):
            gradient = self._get_gradient(Xb,y)
            self.w_ += gradient*self.eta  # w <- w + eta*gradient
            # add bc maximizing
In [33]:
class StocasticGradientAscent(SteepestAscent):
    def _get_gradient(self, X, y):
        idx = int(np.random.rand()*len(y)) # grab random instance
        ydiff = y.iloc[idx] - self.predict_proba(X[idx], add_bias = False) # ydiff now scalar
        gradient = X[idx] * ydiff[:,np.newaxis] # make ydiff a column vector and mult through

        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += (np.sign(self.w_[1:]) * self.C1) + (-2 * self.w_[1:] * self.C2)

        
        return gradient
In [34]:
from scipy.optimize import fmin_bfgs
from numpy import ma
class BFGS(SteepestAscent):
    @staticmethod
    def objective_function(w, X, y, C):
        g = expit(X @ w)

        C1, C2 = C
        return -np.sum(ma.log(g[y==1]))-np.sum(ma.log(1-g[y==0])) + C1*sum(np.abs(w)) + C2*sum(w**2) 
        # -np.sum(y*np.log(g) + (1-y) * np.log(1-g))

    @staticmethod
    def objective_gradient(w, X, y, C):
        g = expit(X @ w)
        ydiff = y-g
        gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
        gradient = gradient.reshape(w.shape)

        C1, C2 = C
        gradient[1:] += (np.sign(w[1:]) * C1) + (-2 * w[1:] * C2)

        return -gradient
    
    def fit(self, X, y):
        Xb = self._add_bias(X)
        num_samples, num_features = Xb.shape

        self.w_ = fmin_bfgs(self.objective_function, # what to optimize
                            np.zeros((num_features, 1)), # starting point
                            fprime=self.objective_gradient, # gradient function
                            args=(Xb, y, (self.C1,self.C2)), # extra args for gradient and objective function
                            gtol=1e-03, # stopping criteria for gradient, |v_k|
                            maxiter=self.iters, # stopping criteria iterations
                            disp=False)
        self.w_ = self.w_.reshape((num_features, 1))
In [ ]:
# solver: SteepestAscent, StochasticGradientAscent, BFGS
# regularization parameter: 0-none, 1-L1, 2-L2, 3-L1&L2(here C needs to be a tuple as (C1, C2))

class LogisticRegression:
    def __init__(self, eta, iterations=20, regularization=2, C=0.001, solver=SteepestAscent):
        self.eta = eta
        self.iters = iterations
        self.C = C
        self.solver = solver
        self.reg = regularization
        self.classifiers_ = []
        # internally we will store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained MultiClass Logistic Regression Object'
        
    def fit(self,X,y):
        num_samples, num_features = X.shape
        self.unique_ = np.unique(y) # get each unique class value
        num_unique_classes = len(self.unique_)
        self.classifiers_ = [] # will fill this array with binary classifiers
        
        for i,yval in enumerate(self.unique_): # for each unique value
            y_binary = (y==yval) # create a binary problem
            # train the binary classifier for this class

            blr = self.solver(eta=self.eta, iterations=self.iters, C=self.C, regularization=self.reg)
            blr.fit(X,y_binary)

            # add the trained classifier to the list
            self.classifiers_.append(blr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T

    def fit_OvO(self,X,y): # my failed attempt at a one-vs-one classifier
        num_samples, num_features = X.shape
        combos = pd.Series(list(it.combinations(np.unique(y), 2)))
        self.classifiers_ = [] # will fill this array with binary classifiers
        
        for i,yval in enumerate(combos): # for each unique value
            y_binary = [c for c in y if c==y_val[0] or c==y_val[1]]
            y_binary = (y_binary == y_val[0])
            X_binary = [c for c in y if c==y_val[0] or c==y_val[1]]

            blr = self.solver(eta=self.eta, iterations=self.iters, C=self.C, regularization=self.reg)
            blr.fit(X_binary,y_binary)

            # add the trained classifier to the list
            self.classifiers_.append(blr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self,X):
        probs = []
        for blr in self.classifiers_:
            probs.append(blr.predict_proba(X)) # get probability for each classifier
        
        return np.hstack(probs) # make into single matrix
    
    def predict(self,X):
        return self.unique_[np.argmax(self.predict_proba(X),axis=1)] # take argmax along row
  • [1.5 points] Train your classifier to achieve good generalization performance. That is, adjust the optimization technique and the value of the regularization term(s) "C" to achieve the best performance on your test set. Visualize the performance of the classifier versus the parameters you investigated. Is your method of selecting parameters justified? That is, do you think there is any "data snooping" involved with this method of selecting parameters?
    • aim for scikit-learn to determine 'good' performance
    • for loop, grid search of different values (below 20 models) and say which ones perform well
      • GD, SGD, NM with values 0.001, 0.01, 0.1 for example
    • any data snooping? (yes!, its when you look at your data too much, you keep training and testing on the same data so test data is leaking into the model to produce an artificially high accuracy)

Let's start with SKLearn to get an understanding of where the benchmark is. Let's train and test on a standard SKL liblinear solver

In [37]:
%%time
from sklearn.linear_model import LogisticRegression as SKLogisticRegression

lr_sk = SKLogisticRegression(solver='liblinear') # all params default

lr_sk.fit(X_train,y_train)
yhat = lr_sk.predict(X_train)
print('Training accuracy of: ',accuracy_score(y_train,yhat))

y_hat_test = lr_sk.predict(X_test)
print('Testing accuracy of: ',accuracy_score(y_test,y_hat_test))
Training accuracy of:  0.7799766610925977
Testing accuracy of:  0.7787401953707713
CPU times: user 1.77 s, sys: 95.8 ms, total: 1.87 s
Wall time: 1.85 s

~78% accuracy in under 2 seconds! Not bad... Can we do better?

In [39]:
%%time
lr = LogisticRegression(eta=1,
                        iterations=250,
                        regularization=0,
                        C=0.01,
                        solver=StocasticGradientAscent
                        )
lr.fit(X_train,y_train)
yhat = lr.predict(X_train)
print('Training accuracy of: ',accuracy_score(y_train,yhat))

y_hat_test = lr.predict(X_test)
print('Testing accuracy of: ',accuracy_score(y_test,y_hat_test))
Training accuracy of:  0.4533642914716745
Testing accuracy of:  0.4552235214859728
CPU times: user 142 ms, sys: 184 ms, total: 325 ms
Wall time: 260 ms

Not a great accuracy... Although this is a pretty rudimentary run, and it was fast! Let's change everything and see what our optimal run is!

Warning!:¶

Running this chunk will take a LONG time. As soon as it stops running, I'll let you know!

In [ ]:
import time
import warnings
warnings.filterwarnings("ignore")

def robust_test(X_train,y_train,X_test,y_test):
    data = []
    for regularization in [0,1,2]:
        for eta in range(5):
            eta = 1 / 10**eta
            for iterations in [10,20,50]:
                for C in range(5):
                    C = 1 / 10**C
                    for solver in [StocasticGradientAscent]:
                        startTime = time.time()
                        lr = LogisticRegression(eta=eta,iterations=iterations,regularization=regularization,
                            C=C,solver=solver)
                        lr.fit(X_train,y_train)
                        yhat = lr.predict(X_train)
                        y_hat_test = lr.predict(X_test)
                        run_time = (time.time() - startTime)
                        
                        data.append({'reg' : regularization, 'eta' : eta, 'iterations' : iterations, 'C' : C, 'solver' :
                        solver, 'train_accuracy' : accuracy_score(y_train,yhat), 'test_accuracy' : accuracy_score(y_test,y_hat_test), 'run_time' : run_time})
    return data
    
data = robust_test(X_train,y_train,X_test,y_test)
df_data = pd.DataFrame(data)
df_data.to_csv("run_out2.csv")

data_pca = robust_test(X_train_pca,y_train,X_test_pca,y_test)
df_data_pca = pd.DataFrame(data_pca)
df_data_pca.to_csv("run_out_pca2.csv")
  • [1.5 points] Compare the performance of your "best" logistic regression optimization procedure to the procedure used in scikit-learn. Visualize the performance differences in terms of training time and classification performance. Discuss the results.
In [149]:
df_tests.sort_values(by='test_accuracy',ascending=False)[:1]
Out[149]:
Unnamed: 0 reg eta iterations C solver train_accuracy test_accuracy run_time
268 268 1 0.001 50 0.0001 BFGS 0.779808 0.785092 20.802865

Here is our best performance logistic regression optimization procedure. Since we plan to run this offline, we're going to turn up iterations to a higher number. This will kill our time but should help accuracy (we just want to beat SKLearn at something!)

In [150]:
%%time
warnings.filterwarnings("ignore")
lr = LogisticRegression(eta=0.001,
                        iterations=250,
                        regularization=1,
                        C=0.0001,
                        solver=BFGS
                        )
lr.fit(X_train,y_train)
yhat = lr.predict(X_train)
print('Training accuracy of: ',accuracy_score(y_train,yhat))

y_hat_test = lr.predict(X_test)
print('Testing accuracy of: ',accuracy_score(y_test,y_hat_test))
Training accuracy of:  0.7800488432804399
Testing accuracy of:  0.7790770415283191
CPU times: user 9.28 s, sys: 11.2 s, total: 20.5 s
Wall time: 20.5 s
In [139]:
%%time
lr_sk = SKLogisticRegression(solver='liblinear') # all params default

lr_sk.fit(X_train,y_train)
yhat = lr_sk.predict(X_train)
print('Training accuracy of: ',accuracy_score(y_train,yhat))

y_hat_test = lr_sk.predict(X_test)
print('Testing accuracy of: ',accuracy_score(y_test,y_hat_test))
Training accuracy of:  0.7799766610925977
Testing accuracy of:  0.7787401953707713
CPU times: user 1.74 s, sys: 104 ms, total: 1.85 s
Wall time: 1.84 s

And testing the same method on PCA

In [154]:
%%time
lr.fit(X_train_pca,y_train)
yhat = lr.predict(X_train_pca)
print('Training accuracy of: ',accuracy_score(y_train,yhat))

y_hat_test = lr.predict(X_test_pca)
print('Testing accuracy of: ',accuracy_score(y_test,y_hat_test))
Training accuracy of:  0.47851978393465106
Testing accuracy of:  0.4752899282998893
CPU times: user 978 ms, sys: 1.25 s, total: 2.23 s
Wall time: 2.17 s
In [153]:
%%time
lr_sk.fit(X_train_pca,y_train)
yhat = lr_sk.predict(X_train_pca)
print('Training accuracy of: ',accuracy_score(y_train,yhat))

y_hat_test = lr_sk.predict(X_test_pca)
print('Testing accuracy of: ',accuracy_score(y_test,y_hat_test))
Training accuracy of:  0.47851978393465106
Testing accuracy of:  0.47562677445743706
CPU times: user 313 ms, sys: 55.9 ms, total: 369 ms
Wall time: 300 ms

Ok. Let us discuss these results. We will have in-depth visualizations of the various inputs of our model and their performance in the exceptional work section (we're splitting it up!)

First, we're quite pumped that we beat SKLearn in accuracy by the tiniest margin! However, the time it took to run is unfortunately very high. We will discuss this in greater depth in deployment, but the marginal benefit in accuracy of running hundreds of iterations and using BFGS (which is very time consuming), may be outweighed by the substantial time it took to run. This all depends on the use case.

This also being said, we trained and tested over 100k instances in less than half a minute. While this may be a longer run time than SKLearn, it is a relatively fast run time given the datasets probably won't reach beyond a million at the most.

Deployment (1 point total)¶

  • Which implementation of logistic regression would you advise be used in a deployed machine learning model, your implementation or scikit-learn (or other third party)? Why?
    • after comparing, this is where you ask who cares? is training time important or or can you ignore in deployment? discuss performance and how it will influence third party

Deployment Decision!¶

Airlines¶

Shockingly, we believe that the DREAM TEAM's implementation, using the optimal values we derived above, would be better than SKLearn if it were deployed as a machine learning model in this specific business use case.

Reasoning: You may be saying, "ARE YOU INSANE?" Yes, but we actually have a simple rationale for our reccomendation in this use case. In our classification problem, accuracy is the most important factor. If we are able to correctly assess which passenger flies in which class, we can spend a difference of tens to hundreds of dollars and earn much more from their business (See LTV/CAC discussion above).

Because our model is being trained offline, and it can be trained on 100k+ instances in the amount of time it would take to write a short business memo, we believe training time is negligable and can be ignored for this deployment. Therefore, we advise this is deployed over Scikit-learn.

General ML problems¶

We believe there is some balance between speed and accuracy that our implementation is not finding. While we are sad to say so, a general SKLearn implemetaion is probably a better balance for most problems and it is only nominally inferior from an accuracy standpoint to our implementation.

PCA discussion?¶

When discussing whether or not we should use PCA, we must consider the tradeoffs between speed and accuracy. Unfortunately, while we thought we had found the elusive "knee" in our PCA analysis, our accuracy was extremely subpar. It is possible we found the knee, but a greater number of dimensions is Unfortunately required to run linear regression to a satisfactory level on this dataset.

THIS BEING SAID... We were very surprised by the impressive difference in time between training and testing PCA and the full dataset (which we will do in exceptional work). If the client believes speed is an important consideration, then a PCA implementation with more dimensions could be a perfect balance and still superior to SKL.

In [ ]:
 

Exceptional Work (1 point total)¶

For the exceptional work we attempted to do a one-vs-one extension, and the code can be seen in the logistic regression methods, however, we could not finalize it in order for it to work correctly in time. Instead, we hope our extensive analysis, discussion of PCA implementations and the robust testing with lots of graphs (below) comprises exceptional work.

You can see this implementation in: fit_OvO

Method Analysis¶

In [14]:
#import robust runs
df_tests = pd.read_csv("run_out_robust.csv")
df_tests_pca = pd.read_csv("run_out_robust_pca.csv")
In [113]:
width = 0.3
names = ['BFGS','SA','SGA']
axis = np.arange(3)
plt.bar(axis-0.15, df_tests.groupby('solver').mean()['test_accuracy'], width=width,label='Full')
plt.bar(axis+0.15, df_tests_pca.groupby('solver').mean()['test_accuracy'], width=width,label='PCA')
plt.xticks(axis,names)
plt.title('Average Test Accuracy')
plt.xlabel('method')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
plt.clf()
plt.bar(axis-0.15, df_tests.groupby('solver').mean()['run_time'], width=width,label='Full')
plt.bar(axis+0.15, df_tests_pca.groupby('solver').mean()['run_time'], width=width,label='PCA')
plt.xticks(axis,names)
plt.title('Average Run Time')
plt.xlabel('method')
plt.ylabel('Average Time in Seconds')
plt.legend()
Out[113]:
<matplotlib.legend.Legend at 0x7f266062e130>

The above charts show a couple crucial things about our implementation. First, BFGS is vastly superior in accuracy to the other two methods. In fact, SA and SGA are both below 50% accuracy on average. This being said, the mean may not be the best value to look at when considering we want the absolute best considering the parameters we choose. Let's examine a boxplot of the means and the values for these with maxes

In [135]:
fig, ax = plt.subplots(figsize=(10,8))
df_tests.groupby('solver').boxplot(column=['train_accuracy','test_accuracy'],ax=ax)
Out[135]:
BFGS         AxesSubplot(0.1,0.559091;0.363636x0.340909)
SA      AxesSubplot(0.536364,0.559091;0.363636x0.340909)
SGA              AxesSubplot(0.1,0.15;0.363636x0.340909)
dtype: object
In [136]:
fig, ax = plt.subplots(figsize=(10,8))
df_tests.groupby('solver').boxplot(column=['run_time'],ax=ax)
Out[136]:
BFGS         AxesSubplot(0.1,0.559091;0.363636x0.340909)
SA      AxesSubplot(0.536364,0.559091;0.363636x0.340909)
SGA              AxesSubplot(0.1,0.15;0.363636x0.340909)
dtype: object

BFGS has very few outliers and relatively stable performance in terms of accuracy compared to the other two. At the same time, the run time distribution is both much higher than for the other two and quite a wide range. It does appear to us that Steepest Ascent could be used if given the right parameters, if run time is a larger factor than accuracy.

In [15]:
width = 0.3
names = [10,20,50]
axis = np.arange(3)
plt.bar(axis-0.15, df_tests.groupby('iterations').mean()['test_accuracy'], width=width,label='Full')
plt.bar(axis+0.15, df_tests_pca.groupby('iterations').mean()['test_accuracy'], width=width,label='PCA')
plt.xticks(axis,names)
plt.title('Average Test Accuracy')
plt.xlabel('iterations')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
plt.clf()
plt.bar(axis-0.15, df_tests.groupby('iterations').mean()['run_time'], width=width,label='Full')
plt.bar(axis+0.15, df_tests_pca.groupby('iterations').mean()['run_time'], width=width,label='PCA')
plt.xticks(axis,names)
plt.title('Average Run Time')
plt.xlabel('iterations')
plt.ylabel('Average Time in Seconds')
plt.legend()
Out[15]:
<matplotlib.legend.Legend at 0x2259d854e80>

It comes as minimal surprise that the more iterations that are run, the better the performance and the longer it takes to run. This seems to be true asympotically in terms of accuracy and linearly with run time.

In [120]:
width = 0.3
names = ['0.0001','0.001','0.01','0.1','1']
axis = np.arange(5)
plt.bar(axis-0.15, df_tests.groupby('C').mean()['test_accuracy'], width=width,label='Full')
plt.bar(axis+0.15, df_tests_pca.groupby('C').mean()['test_accuracy'], width=width,label='PCA')
plt.xticks(axis,names)
plt.title('Average Test Accuracy')
plt.xlabel('C value')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
plt.clf()
plt.bar(axis-0.15, df_tests.groupby('C').mean()['run_time'], width=width,label='Full')
plt.bar(axis+0.15, df_tests_pca.groupby('C').mean()['run_time'], width=width,label='PCA')
plt.xticks(axis,names)
plt.title('Average Run Time')
plt.xlabel('C value')
plt.ylabel('Average Time in Seconds')
plt.legend()
Out[120]:
<matplotlib.legend.Legend at 0x7f26605103a0>

C value has minimal bearing on accuracy except for C = 0.1, which is slower AND less accurate than its peers. It does seem that the smallest C value has high average accuracy and minimizes run time.